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Abstract 

This paper presents a new method for detecting regions of a 
program where the memory hierarchy is performing poorly. By 
observing where actual measured execution time differs from the 
time predicted given a perfect memory system, we can isolate 
memory bottlenecks. MTOOL, an implementation of the ap- 
proach aimed at Fortran programs running on MlPS-chip based 
workstations is described and results for some of the Perfect Club 
and SPEC benchmarks are reported. 
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1 Introduction 



Many modern conputer architectures including cache -based uniprocessors 
and most shared menory mil t i pr oces s or s present the programner with a 
(decept i ve) uni for macces s model of me nor y. RISC ar chi t ect ur es f or exanpl e 
pr ovi de s i npl e 1 oad and s t or e i ns t r uct i ons to represent memory ope r at i ons . 
I n pr act i ce , hoTOver , t he 1 oad i ns t r uct i on nay i nvol ve acces s i ng on- chi p cache 
whi ch i s backed by a s econd 1 evel cache of s t at i c RAMwhi ch i s backed by 
imi n memor y. The t i me di ffer ence bet TOen a hi t i n t he fir s t 1 evel cache and 
a ni s s to imi n memor y can be one t o t to or der s of nagni t ude (see Tabl e 1) . 

Pr ogr anmer s s eeki ng t o i npr ove per f or nance s omet i mes find it us ef ul to 
opt i ni ze an al gor i t hmm threspect to a par t i cul ar memor y hi er ar chy. St udi es 
1 i ke [1], [ 2] , [7] and [ 8] have report ed s peed- ups of 100% and more oto ng t o 
inproved cache perfornance when nested loops are reordered and natrix 
algorithns are blocked. To nake use of such techniques, the user mist 
know where memory bottlenecks lie and when a t r ans f or nat i on inproves 
perf ornance. 

Two techniques are typically enployed to isolate memory bottlenecks. 
The 1 eas t t i me cons uni ng appr oach is to statically anal yze a pr ogr arq us - 
i ng dependency anal ys i s t o i dent i f y hownany i t ens w 1 1 be i n cache af t er a 
cer t ai n nunber of i t er at i ons of a 1 oop [ 3, 6] . Such t echni ques ar e i npor t ant 
be cans e t hey ar e pot ent i al 1 y f as t enough to i ncor por at e i nt o conpi 1 er s to 
aut oimt i cal 1 y nanage t r ans f or imt i ons . HoTOver , t he s t at i c t echni ques rely 
on t he si npl e s t r uct ur e of bot h t he 1 oop and t he memor y s ys t emt o per f or m 
their analyses. The approxinate nature of analytic techniques and t he in- 
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cr eas i ng conpl exi t y of t he me nor y hi er ar chi es t hey at t enpt to no del imke 
t hemi nappr opr i at e as a conpl et e per f or nance debuggi ng s ol ut i on. 

At t he oppos i t e end of t he s pect r urq t her e ar e t r ace dr i ven s i mil at or s 
whi ch s i mil at e t he execut i on of ever y me nor yreference inapr ogr am Thes e 
s i mil at i ons can model t he ent i r e me nor y hi er ar chy and pr oduce acces s t i me 
es t i imt es for every var i abl e reference inapr ogr am The obvi ous dr awback of 
s i mil at i on i s its cos t ; f or 1 ar ge pr ogr ans , s i mil at i on nay be qui t e expens i ve . 
Also, it is non- trivial to correctly model a conpl icated memory hierarchy, 
par t i cul ar 1 y when mil t i pr ogr amri ng or mil t i pr oces s i ng i s i nvol ved. 

Our t echni que s t r i kes a bal ance bet TOen t he expens e of s i mil at i on and 
t he i naccur acy of s t at i c anal ys i s . Qir key obs er vat i on i s t hat if to as s ume 
memor yaccess timeis uni f or mt hen, at 1 eas t f or si npl er architectures, it is 
relativelycheapto correctlyesti mat e t he CPU execut i on t i me of a pr ogr am 
By conpar i ng t hi s uni f or macces s model es t i imt e to t h act ual obs er ved exe- 
cut i on t i me , TO can i s ol at e r egi ons i n a pr ogr amwher e t he memor y hi er ar chy 
perforns poorly. 

The next s ect i on di s cus s es t he met hod f or es t i imt i ng execut i on t i me as - 
suning constant memory access time. Section 3 describes how t o use the 
es t i imt e to isolate memory bot 1 1 enecks . Sect i on 4 pr es ent s a memor y bot - 
tlenecktool i npl ement at i on, MTOCL, whi ch r uns on t he DECh t at i on 3100 
and 5000. Sect i on 5 pr ovi des exanpl es of MTOCL's us er i nt er f ace . Sect i on 6 
r epor ts results when t he t ool i s r un on t he Per f ect Q ub benchmar ks and s ci - 
ent i fic benchnar ks i n t he SPEC set. Ei nal 1 y, s ect i on 7 gi ves s ome concl us i ons 
about the breadth of appl i cabi 1 i t y of our t echni que and s ugges t s directions 
for future research. 

2 Estimating Bcecution Tina 

Cbns i der a conput er wher e al 1 i ns t r uct i on s chedul i ng i s handl ed by s of t ™r e 
(i.e., no har drar e i nt er 1 ocks ) and wher e each i ns t r uct i on ( i ncl udi ng memor y 
acces s i ns t r uct i ons ) has a known, fixed execut i on t i me . Eor s uch a conput er , 
TO can det er ni ne t he execut i on t i me of a pr ogr amgi ven i ns t r uct i on execut i on 
counts using the for mil a, 

execution time = (# oftirms ith instruction executes) * 
(tine per execut ion of i nst ruct i on i ) 
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Let us t r y t o appl y a s i ni 1 ar t echni que t o a RI SC ar chi t ect ur e . Fi r s t to 

di vi de t he pr ogr ami nt o has i c bl ocks . Abas i c bl ock i s a gr oup of i ns t r uct i ons 

wi t h a uni que ent r y poi nt s uch t hat when t he ent r y i ns t r uct i on execut es , al 1 

ot her i ns t r uct i ons i n t he bas icblockwll execut e. Itispossibletoidentifyall 

bas i c bl ocks i n nos t execut abl e pr ogr ans by exani ni ng br anch i ns t r uct i on 

des t i nat i ons and i ndi r ect junp t abl es. Afterdeterniningthe bas i c bl ocks , to 

i ns t r ume nt t he e xe c ut abl efilebyprecedingeachblockTOthcodetoincrement 

a count er . Runni ng t he i ns t r ument ed pr ogr ampr o duces a t abl e of bas i c bl ock 

count s . Adi s cus s i on of met hods f or i ns t r ument i ng conpi 1 ed code i s pr ovi ded 

i n t he Appendi x. 

liing the counts and our knowledge of when hardmre interlocks are 
t r i gger ed, we can es t i imt e howl ong each bas i c bl ock execut es . Qir es t i imt es 
TO 1 1 have t TO s hor t coni ngs : 

1 . Mnor y acces s i ns t r uct i ons do not execut e i n cons t ant t i me . 

2. Ther e nay be har drar e interlocks across bas i c bl ock boundar i es . 

The first short coning is actually the feature on which our bottleneck 
det ect i on t echni que i s bas ed. We as s ume al 1 memor y acces s es t ake t he ni ni - 
numpos s i bl e t i me ( t ypi cal 1 y t he t i me f or a pr i nar y cache hi t ) and when our 
pr edi ct i on di s agr ees to t h meas ur ed execut i on t i me to r epor t a bot 1 1 eneck. 

The s econd TOaknes s is not aproblemfor nany RI SC ar chi t ect ur es be- 
cause there are f ew har dmr e interlocks and these interlocks rarely cross 
bas i c bl ock boundar i es . Qi t he M PS pr oces s or wher e our exper i ment s TOr e 
performed, the inter- block interlocks TOre negligible in real code. If such 
interlocks occur TOth appreciable frequency, they can be estinated by i n- 
s t r ument i ng to col 1 ect br anch f r e queue i es as toI 1 as bas i c bl ock count s . The 
br anch f r equenci es tell us how of ten one basic bl ock pr ecedes another and 
TO can i npr ove our es t i imt e by i ncl udi ng i nt er 1 ocks bet tocu adj acent bas i c 
bl ocks . 

Thus , TO have a t echni que f or es t i imt i ng execut i on t i me of a whol e pr o- 
gram Mreover, our met hod cos t s onl y one i ns t r ument ed pr ogr amexecut i on 
r at her t han r equi r i ng a f ul 1 , expens i ve nachi ne s i mil at i on. The t as k now i s 
to use this esti imt i on t echni que to i s ol at e bot 1 1 enecks . 
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3 I solat i ng B ot 1 1 enecks 



Inthis secti on, to devel op a f r ameTOr k f or det ect i ng bot 1 1 enecks by meas ur - 
i ng di ver gence f r ompr edi ct ed behavi or . W begi n by f or iml i zi ng t he not i on 
of meas ur i ng act ual t i me s pent i n a r egi on of code . A meas ur abl e obj ect or 
mobject is a s et of i ns t r uct i ons i n whi ch to can i dent i f y al 1 ent r y and exi t 
poi nt s . Tlie obj ect is measur abl e be cans e to can pi ace s t ar t t i mer and s t op 
timer calls at these entry and exi t poi nt s to meas ur e t he t i me s pent i n t he 
obj ect . For exanpl e, to can t i me a procedure by pi aci ng a start_tifmr cal 1 at 
t he t op of t he pr ocedur e and st op J i rmr cal 1 s bef or e ever y return s t at ement . 
Si ni 1 ar 1 y, to can t i me a 1 oop by pi aci ng a st art.ti mr above t he t op of t he 
1 oop and a stop.ti mr bel ow t he bot t omof t he 1 oop. 

W s ay an m obj ect is ti rmble if to can i ns t r ument t he pr ogr amt o mea- 
s ur e t he t i me s pent to t hi n t he obj ect . A t i imhl e obj ect mis t satisfyt™ 
criteria: 

1 . The t ot al t i me s pent to t hi n t he obj ect s ubs t ant i al 1 y exceeds t i mer gr an- 
ul ar i t y. 

2. The per t ur bat i on cr eat ed by t he t i mer is not s i gni ficant . 

The fir s t as pec t of t i nabi lityis rarelyapr obi emas most systens pr ovi de at 
least a l/60thof a s econd t i mer , and m obj ects of interest t ypi cal 1 y execut e 
for seconds, ninutes, or even hours. The per t ur bat i on i s s ue is more diffi- 
cult. To avoi d changi ng memory per fornance, to require that the number 
of me nor y ope r at i ons per f or me d by t he m obj ect s ubs t ant i al 1 y exceed t he 
nunber performed in a clock timer call. In addi t i on, to avoi d appr eci abl y 
s 1 OTO ng t he pr ogr amdoTOi, to r equi ret hat t he t i me s pent i n t he m obj ect 
s ubs t ant iallyexceedthe time to nake a clock call . 

li i ng t he above criteria, TOcanidentifyregi ons of t he pr ogr amwhos e 
act ual execut i on t i mes can be meas ur ed. Thes e execut i on t i mes i ncl ude , hoTO 
ever , not onl y t he TOr k done i n an m obj ect pr oper , but al s o t he TOr k done 
on behal f of t he obj ect by any pr ocedur es t hat i t cal 1 s . In cont r as t , t he bas i c 
bl ock count i ng es t i imt i on t echni que of t he pr evi ous s ect i on cal cul at es onl y 
t he TOr k done i n a bas i c bl ock; it i gnor es t he t i me s pent in pr ocedur e cal 1 s . 
Fur t her nor e , whi le to canesti imt e t he t ot al t i me s pent i n a pr ocedur e g, 
TO cannot neces s ar i 1 y det erni ne t he t i me s pent in q on behal f of a par t i cul ar 
cal 1 er . Thus , to cannot al TOys es t i mat e t he t i me s pent i n and on behal f of an 
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m obj ect t hat calls q. WtoII sayt hat t he m obj ect s whos e execut i on t i me 
can be accur at el y pr edi ct ed by bas i c - bl ock count i ng t echni ques ar e esti mhl e. 

To i dent i f y es t i nabl e m obj ect s , to expl oi t i nf or nat i on about t he s t r uc- 
ture of a progranis call graph. In a call graph, nodes represent procedures 
and there is a directed edge f r omnode u t o node w ii pr ocedur e u cal 1 s w 
during the execution of the program The call graph has a distinguished 
node , t he r oot , whi ch i s t he pr ocedur e wher e execut i on begi ns . 

W s ay a node v doni nat es wii ever y pat h t hr ough t he gr aph f r omt he 
root to tupasses through v. The useful aspect of the call graph is that a 
node is esti rmbl e i f i t doni nat es al 1 of its des cendant s . I nt ui t i vel y, if a node 
V doni nat es wi hen al 1 t he TOr k in wi s done on behal f of v. Efence , i f a 
node doni nates all of its children, then the estirmted time for that node 
and its des cendant s i s j us t t he s umof each of t hei r esti nat ed t i mes , i gnor i ng 
pr ocedur e cal 1 s . 

Thi s obs er vat i on s er ves as a TOr ki ng defini t i on of esti nabi 1 i t y. The defi- 
ni t i on i npl i es t hat bot h t he r oot of t he cal 1 gr aph, whi ch cor r es ponds t o t he 
execut i on of t he whol e pr ogr arq and t he 1 eaves whi chc or res pond to call-free 
procedures, are estinable. Q ven this operational definition of estinabil- 
i t y, t he pr i mar y i s s ue in i s ol at i ng me nor y bot 1 1 enecks now becomes one of 
gr anul ar i t y of detection. 

W coul d es t i nat e t he execut i on t i me of t he f ul 1 pr ogr am and conpar e 
t hi s number agai ns t act ual r un t i me , but t hi s to 1 1 onl y des cr i be t he nagni t ude 
of t he me nor y effect s , not localizethem Inst ead, our appr oach i s to find a s et 
of s nal 1 er , t i nabl e m obj ect s cont ai ni ng t he naj or i t y of me nor y ope r at i ons , 
and then to select menber s of this set t hat ar e esti nabl e . The next s ect i on 
outlines our algorithm 

4 The I npl emit at ion 

Thi s s ect i on des cr i bes one i npl ement at i on of t he esti nabl e , t i nabl e m obj ect 
appr oach t o i s ol at i ng me nor y bot 1 1 enecks , MTOCL. Thi s s peci fic i npl emen- 
t at ion is for Fortran prograns running on MPS- chip based TOr ks tat ions. 
For t r an tos chos en as a t ar get 1 anguage becaus e 1 ar ge For t r an pr ogr ans of - 
ten have me nor y bot 1 1 enecked r egi ons and mich of t he r es ear ch on al 1 evi at i ng 
me nor y bot 1 1 enecks has concent r at ed on s ci ent i fic code . 

MTOCL s eeks t o i s ol at e bot 1 1 enecks at t he 1 eve 1 of pr ocedur es and 1 oops . 
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Tlii s decisionreflects thefact t hat pr ocedur es and 1 oops have nat ur al meani ng 
t o t he us er , s at i s f y t he defini t i on of m obj ect , and t ypi cal 1 y r un 1 ong enough 
to meet the t i nabi 1 i t y criteria. The first step of the bottleneck isolation 
process is to instr ument t he pr ogr amof interest to collect basic block count s 
and to run the instrumented code on a representative input. The basic 
block counts are useful not only for estinating execution time; they also 
provide MTOOL to t h pr eci s e knowledge of where menDr y oper at i ons are 
concent r at ed. 

MTOCL s or t s t he pr ocedur es by t he nunber of memor y oper at i ons t hey 
execut e and selects those t hat cont ai n t he fir s t 95%of al 1 memor y oper at i ons . 
For each of t hes e pr ocedur es , MTOCL makes a 1 i s t of 1 oops and tries to 
meas ur e fir s t t he i ndi vi dual 1 oops and t hen t he whol e pr ocedur e , s ubj ect to 
t i nabi 1 i t y and es t i nabi 1 i t y cons t r ai nt s . Ti nabi 1 i t y cons t r ai nt s ar e s t r ongl y 
s ys t emde pendent . DEC s ULTEI X pr ovi des al/60thof a s econd gr anul ar i t y 
clock, but a s ys t emcal 1 i s r equi red to re ad t he clock. 

The overhead of the sys tern call perturbs execution undesirably. The 
cl eanes t s ol ut i on wjul d be t o modi f y t he oper at i ng s ys t emt o ere at e a clock 
directly accessible by user processes, but a s i npl er , nor e to del y appl i cabl e 
opt i on ™s t o add an i nt er val timer to ere at e a clockinuser menDr y. St ar t 
and stopclockcalls access theclockinuser memor y whi ch i s updat ed by t he 
interrupts of the interval timer. 

li i ng t he us er memor y clock, MTOCL' s t i mer has gr anul arityof l/60th 
of a s econd and w)r k per s t ar t / s t op cal 1 of about 70 i ns t r uct i ons . The r e- 
s ul t s r epor ted in Section 6 seemto i ndi cat e t hi s gr anul ar i t y and over head 
ar e accept abl e . Al s o, to expect t hat i nt er es t in char act er i zi ng and i npr ov- 
i ng per f or nance to 1 1 dr i ve ar chi t ect s and oper at i ng s ys t ens pr ogr anmer s t o 
pr ovi de bet t er cl ocks in t he f ut ur e . 

At t hi s poi nt , MTOCL has gat he red a collection of ti imhl e 1 oops and 
pr ocedur es . The next stepis todeternine whi ch of t hes e obj ects is esti imhl e . 
Cbncept ual 1 y, MTOCL uses a si npl e depth first algorithmto label each 
pr ocedur e i n t he cal 1 gr aph to t h t to par amet er s : 
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TDES{ V ) exact 1 y when wis es t i imbl e . Tlii s r el at i on f or nal i zes t he obs er va- 
t i on of t he pr evi ous s ect i on t hat a node i s es t i nabl e when al 1 t he w)r k of its 
des cendant s i s done on i t s behal f . 

Thus , TO have a test for esti nabi 1 i t y. Ekt endi ngthetest tol oops s i npl y 
i nvol ves checki ng t he anal ogous condi t i on t hat : 



Inmost cases, the 1 oops and pr ocedur es selected by MTOCL i rnnedi at el y 
satisfy the esti nabi 1 i t y condi t i on. Tlie condi t i on i s met f r equent 1 y be cans e 
t he obj ects selected by MTOCL cont ai n t he naj or i t y of me nor y oper at i ons 
whi ch t ypi cal 1 y means t hey i nvol ve t he mos t f r equent 1 y execut ed por t i ons of 
code whi ch ar e nor nal 1 y 1 eaves or obj ects al 1 of whos e chi 1 dr en ar e 1 eaves . 

Wen t he t es t is not s at i s fied i nmedi at el y, s ever al opt i ons ar e aval 1 abl e . 
Qie nat ur al choi ce i s to nDve up t he cal 1 gr aph as to knowt hat event ual 1 y 
we TO 1 1 encounter an estinable node, the root. This option is not ideal, 
hoTOver , be cans e i t r educes t he pr eci s i on to t h whi chTO localize bot 1 1 enecks . 
A bet t er choi ce is to check whet her to can i n f act accur at el y es t i nat e t he 
t i me s pent i n a pr ocedur e cal 1 . Bel ow, to des cr i be t hr ee cas es wher e to can 
make an accur at e esti nat e . 

Mny s ci ent i fic 1 i br ar y r out i nes ar e si npl e , 1 oop- free 1 eaf pr ocedur es t hat 
al TOys f ol 1 ow t he s ame execut i on pat h. Tliei r esti nat ed execut i on t i me s ar e 
cons equent 1 y cons t ant . W can of t en i dent i f y s uch pr ocedur es by checki ng 
that they meet t to condi t i ons : 

1 . Tlie pr ocedur e i s a 1 oop- free and call-free. 

2 . Tlie aver age t i me per cal 1 as det er ni ned by bas i c bl ock count s i s equal 
t o t he naxi numor ni ni mimpos s i bl e t i me per cal 1 . 

The fir s t condi t i on i npl i es t he pr ocedur e is esti rmbl e and t hat to can 
run shortest and longest pat h al gor i t hns on the control flow graph of the 
pr ocedur e t o bound i t s execut i on t i me . li i ng t he s e bounds , TOcancheckthe 
s econd condi t i on whi ch i npl i es t hat t he execut i on t i me per cal 1 is cons t ant 
be cans e aver age equal s ext r enum Tlii s test finds t hat s uch comron 1 i br ar y 
cal 1 s as SQRTO and EXP() have cons t ant execut i on t i me s when cal 1 ed i n t he 
Perfect Q ub benchnar ks . 



uxxiled in lap 
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1 . I ns t r ument execut abl e pr ogr amand col 1 ect bas i c bl ock 

2. Select loops and procedures cont ai ni ng nos t menory opei' 

3. El i ni nat e selected obj ect s t hat fail tomeet ti nabi 1 i t y 

4. El i ni nat e selected objects that fail to meet esti 
s t r ai nt s . 



:ount s . 
at i ons . 
cons t r ai nt s , 
imhi 1 i t y con- 



5. Instrument code to measure actual time spent in renaitiing se- 
lected obj ect s . 

6. Run instrumented code, correlate actual times to t h e^tinated 
times toisolate bot 1 1 enecks , and r epor t bot 1 1 enecks t c t he us er , 

Fi gur e 1 : MTOCL' s Bot 1 1 eneck I s ol at i on AI gor i t hm 



Tto ot her s i npl e heur i s t i cs s uffce to el i ni nat e nany ot her pr obi eimt i c 
cal 1 s . Suppos e obj ect ucalls pr ocedur e p. Then, to can nor iml 1 y as s ume 
aver age timeper call fromutopis aver age time per call to v when ei t her : 

1 . The vas t naj or i t y ( 98%) of cal 1 s t o p ar e imde by u, or 

2 . ( avg. timeper call top) *(^of cal Is fromutop) <Ctimespent inu, 
s o any error i n t he appr oxi nat i on i s negl i gi bl e . 

Bot h of t hes e heur i s t i cs can be i naccur at e under pat hoi ogi cal condi t i ons 
( when t he var i ance of t he execut i on t i me of p acr os s cal 1 s is 1 ar ge) , so MTOCL 
issues a TOr ni ng whenever it invokes them They have not caused pr obi ens 
wi t h t he benchnar ks meas ur ed i n t hi s s t udy. 

The s t eps MTOCL us es to i s ol at e me nor y bot 1 1 enecks ar e s ummr i zed 
i n Fi gur el. St ep 6 i s of cour s e t he mos t s i gni ficant to t he us er . 



5 User Iiterface 

The us er vi ew of MTOCL i s cons i der abl y 1 es s conpl ex t han t he al gor i t hns 
of the previous sections. The user types MTOOL program-name input -files 
and ™i t s while MTOCL instruments the programto collect basic block 
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Predicted User Time: 95.0 

Measured Time (compensated for counters): user 132.4 sys 0.7 
Overhead estimates: User (memory) System (I/O) 

39.3 0.7 
Proceed with memory bottleneck probe insertion (y/n)? 

Fi gure 2: MTOCL' s Ini ti al Bottl eneck Esti imt e 

c ount s , r uns and t i me s t he i ns t r ume nt e d c o de , and di s pi ays aresult liket hat 
s hown i n Fi gur e 2. MTOCL gener at es t he dat a i n t he figur e by esti imt i ng 
t he r un- 1 i me of t he pr ogr amand conpar i ng i t to t h t he meas ur ed execut i on 
t i me , appr opr i at el y adj us t ed f or t he effect s of t he count er s . The i nf or imt i on 
prevents a a us er f r ompr oceedi ng when no s i gni ficant memor y hot 1 1 enecks 
ar e pr es ent . 

As s uni ng t he us er as ks MTOCL t o pr oduce an i ns t r ument ed pr ogr arq 
MTOCL execut es s t eps 2 t o 5 of Fi gur e 1 pr oduci ng a fil e of m obj ect des cr i p- 
t or s and act ual execut i on t i me meas ur ement s . Thi s s ummr y fil e is handed 
off t o t he front end. 

The f r ont end' s t op 1 evel to ndowi s s hoTOi i n Fi gur e 3 f or t he Per f ect Q ub 
benchnar k TFS. f , an ai r fiow s i mil at i on. CVer head i s defined as 

[act ual t irm — est i nut ed t i rm) /est i nut ed t i rm . 

The hi s t ogr ami n t he 1 OTOr left of t he to ndowvi s ual 1 y s ummr i zes t he dat a i n 
the upper right. The 1 i ne "Measured 91% of mmry operati ons" i epoi t s the 
per cent age of al 1 execut ed memor y oper at i ons t hat ar e i ncl uded i n meas ur ed 
objects. All times and over heads refer to user time onl y; t i me s pent i n t he 
oper at i ng s ys t emon behal f of a pr ogr ami s i gnor ed ( t hough i t was r epor ted 
at an earlier stage-see Figure 2). 

By pr es s i ng t he hi s t ogr ambut t on, t he us er can obt ai n a hi s t ogr amwher e 
bar s r epr es ent t ot al t i me s pent on behal f of a pr ocedur e ( Fi g. 4) . Thi s t i me 
is split intoesti imt ed t i me and meas ur ed memor y over head. The bar s ar e 
sorted in decreasing order of memory bottleneck magnitude. By clicking 
t he mous e on t he memor y over head por t i on of a bar , t he us er opens a t ext 
window (Fig. 5) di s pi ayi ng the procedure. 
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Mas ur ed bot 1 1 enecks to t hi n t he pr ocedur e ar e di s pi ayed by hi ghl i ght - 
i ng the text. Bottlenecks are highlighted OIIG Hit Si t ime, TO t h t he overall 
over head cont r i but i on and t he ext r a cycl es per me nor y ope r at i on as s oci at ed 
TOth the bottleneck reported at the top of the text to ndoTO Pressing the 
INFO but ton opens a popup to ndow t hat gives further infornation about 
t he di s t r i but i on of me nor y ope r at i ons when t he meas ur ed obj ect i ncl udes 
1 oops or pr ocedur e cal 1 s . Fi gur e 6 s ho™ a s i npl e I NFOto ndowf or t he t op 
bot 1 1 eneck i n df lux() . Pr es s i ng PREV or NEXT di s pi ays t he pr evi ous or 
next bot 1 1 eneck; bot 1 1 enecks ar e s or t ed by nagni t ude . The t ext to ndow nay 
be s cr ol 1 ed by pr es s i ng t he ver t i cal ar r oto at t he r i ght s i de of t he to ndow, 
and mil t i pi e t ext to ndo™ nay be open s i mil t aneous 1 y. 

Thr ee as pect s of t he us er i nt er f ace des er ve conment f r oman i npl emen- 
t or ' s per s pect i ve . Fi r s t , t he i nt er f ace i s r el at i vel y por t abl e be cans e i t i s bui 1 1 
us i ng I nt er vi e™ [ 4] , an obj ect or i ent ed t ool ki t t hat r uns on t op of X Second, 
t he not i on of cl i cki ng on a bar t o obt ai n f ur t her i nf or imt i on on t he associ- 
ated bot 1 1 eneck i s qui t e gener al . For exanpl e , it w)ul d be eas y t o s uppor t 
an opt i on wher e cl i cki ng on a pr ocedur e ' s es t i imt ed t i me bar woul d pr ovi de 
i nf or imt i on i npl i ci t in our t i me es t i imt e 1 i ke r egi s t er us age , float i ng poi nt 
stalls, IVFLOPS, most often executed lines, etc. A third note about the 
i nt er f ace is t hat is us es s t andar d 1 i ne nunber i nf or imt i on avai 1 abl e i n t he 
obj ect file to relate bas i c bl ock 1 evel i nf or imt i on back t o s our ce code 1 i nes . 
I n t hi s s ens e , t he us er i nt er f ace i s 1 anguage- i ndependent . 



6 Results 

The final measure of MTOCL and its user interface is how toI 1 they iso- 
late me nor y bot 1 1 enecks . Tabl e 2 s ummr i zes our results for a s ubs et of t he 
Perfect Q ub benchnar ks and t he s ci ent i fic benchmar ks i n t he SPFC s ui t e . 
The col unn ent i 1 1 ed "%of Mmor y Ope r at i ons " i ndi cat es t hat our heur i s t i cs 
for selectingti nabl e , es t i nabl e m obj ects succeedinchoosinggood pot ent i al 
bottlenecks, as to measure over 90%of memory operations i n al 1 but tw) 
cases where we measure 85% The col umi on unexpl ai ned over head, which 
r epor t s t he di ffer ence bet TOen meas ur ed over head f or t he whol e pr ogr amand 
the sumof meas ur ed over heads for individual objects confirns this conclu- 
sion, as our meas ur ed m obj ect s t ypi cal 1 y account for all but afewpercent 
of over head. The 1 eas t pos i t i ve r es ul t is t he dat a on nunber of s our ce 1 i nes 
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IJnex- 


^of Source Lines 




%of 


CVer - 


pl ai nec 


[ InAll 


Mdi an/ 


Mx/ 


In All 


Pr ogr am 


nops 


head 


CVer hd 


m obj . 


m obj 


m obj 


Cbde 


tf s 


91% 


40% 


4% 


252 


8 


53 


2020 


nas 


97% 


25% 


0% 


591 


84 


395 


3976 


s ds 


93% 


6% 


2% 


167 


12 


120 


7607 


1 TO 


93% 


9% 


0% 


100 


100 


100 


1237 


Igs 


98% 


12% 


0% 


323 


35 


132 


2327 


f pppp 


85% 


50% 


11% 


940 


274 


666 


2718 


do due 


85% 


29% 


5% 


2541 


102 


477 


5334 


s pi ee2g6 


95% 


46% 


2% 


756 


41 


434 


18411 


dnas a7 


97% 


107% 


1% 


257 


9 


118 


1105 



Table 2: MTOOL EfFe c t i ve ne s s 



cont ai ned i n t he m obj ect s . Qi hal f of t he pr ogr ans , t he medi an obj ect con- 
t ai ns f eTOr t han 50 1 i nes , but f or s ome pr ogr ans 1 i ke f pppp, to ar e per haps 
failing to localize bot 1 1 enecks adequat el y. Thi s pr obi emi s di s cus s ed f ur t her 
i n t he Cbnclusions s ect i on bel ow. 

7 Concl vsi ons 

The r es ul t s r epor ted above s uppor t t he gener al concl us i on t hat our t echni que 
s ucceeds i n det ect i ng me nor y bot 1 1 enecks i n s ci ent i fic For t r an pr ogr ans . Qie 
short coning of the current i npl ement at i on i s that it does not satisfactorily 
1 ocal i ze t he bot 1 1 enecks i n a f ewof t he pr ogr ans (see Tabl e 2) . Thi s pr obi em 
can cer t ai nl y be addr es s ed by nodi f yi ng our al gor i t hmt o select s iml 1 er m 
obj ect s . Cur r ent 1 y, MTOCL s ear ches f or t i imhl e , es t i imhl e m obj ects start- 
i ng TO t h out er 1 oops . Some benefit coul d be der i ved by t r yi ng i nner 1 oops 
fir s t . Si ni 1 ar 1 y, MTOCL wi 1 1 t i me a whol e 1 oop- free pr ocedur e , r egar dl es s of 
how nany 1 i nes it cont ai ns . It touI d be a s i npl e imt t er to add heur i s t i cs 
to partition large procedures into mil t i pi e mobjects. This newheuristic 
woul d benefit pr ogr ans 1 i ke f pppp i n t he SPEC set wher e a key bot 1 1 eneck 
i s a 700 1 i ne 1 oop- free pr ocedur e . Fi nal 1 y, to coul d add a new opt i on wher e 
MTOCL di s pi ays a pr ocedur e to t h i t s f r equent lyexecutedmemoryreferences 
hi ghl i ght ed whenever t he MTOCL selected m obj ect exceeds acertainline 
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count threshold. The user could then nanual 1 y create acceptably propor- 
t i oned m obj ect s cont ai ni ng t hes e hi ghl i ght ed me nor y ope r at i ons , s ubj ect 
t o MTOCL' s checks f or es t i nabi 1 i t y and t i nabi 1 i t y. 

MTOCL coul d al s o be enhanced by br oadeni ng t he cl as s of pr ogr ans f or 
whi chit candetect bot 1 1 enecks . MTOCL i s now restricted to non-recursive 
pr ogr ans t hat do not us e pr ocedur e var i abl es . The restriction on recursion 
der i ves pr i rmr i 1 y f r omt he f act t hat t he s i npl estart/stopclocktimers will not 
WJT k when an m obj ect can be re-entered ( and t he clock re-star ted) before 
it is exited( and the clockis st opped) . The restriction coul d be r e moved by 
pr ohi bi t i ng t he t i ni ng of r ecur s i ve pr ocedur es or by us i ng nor e conpl i cat ed 
t i mer s t hat keep t r ack of t he dept h of t he r ecur s i ve cal 1 and onl y s t ar t and 
s t op t he cl ock on t he fir s t ent r y and last exi t . Be cans e MTOCL i s 1 anguage 
i nde pendent (except f or t he r es t r i ct i on on r ecur s i on) nodi f yi ng t he t i mer s 
( and nodi f yi ng t he check f or es t i nabi 1 i t y of s ect i on 4 t o account f or 1 oops i n 
t he cal 1 gr aph) w)ul d al 1 ow MTOCL t o handl e 1 anguages wthrecursi on. 

The restrictionon pr ocedur e var i abl es is alsoeasyto r enove . MTOCL' s 
defini t i on of es t i nabi lityreqires aTOll - defined cal 1 gr aph and t he us e of pr oce- 
dur e var i abl es means t hat t he f ul 1 s t r uct ur e of t he cal 1 gr aph i s not det er ni ned 
unt i 1 r un- 1 i me . By modi f yi ng t he bas i c bl ock count i ng i ns t r ument at i on to 
r ecor d a dynani c cal 1 gr aph, this restriction coul d be ci r cunvent ed. 

The final and per haps mos t exci t i ng appl i cat i on f or MTOCL t echnol ogy i s 
por t i ng i t t o a s har ed me nor y mil t i pr oces s or . Mmor y bot 1 1 enecks on s har ed 
memory nachi nes can be severe and detecting themis di ffcul t . Analytic 
t echni ques cannot handl e t he conpl exi t y of par al 1 el s ys t ens and s i mil at i ng 
mil tipleinteractingprocessors is extremely conpl ex and expens i ve . MTOCL 
s houl d pr ovi de a vi abl e al t er nat i ve to t hes e appr oaches . 

Acknowledgements: I W3ul d 1 i ke t o t hank Efeivi d Wl 1 of DEC WIL who 

sponsored part of this research during a sunmer internship and provided 
i nval uabl e advi ce on t he i ns and out s of pat chi ng code . 

8 Appendix: Instrunaitiiig Gxfe 

MTOCL r e qui res t he abi 1 i t y to mo di f y an e xe c ut abl e fil e in a non- i nt r us i ve 
nanner. Che appr oach us ed by t he Pi xi e [ 5] pr ogr amf r omM PS i s t o di r ect 1 y 
pat ch an execut abl e . Apr obi emar i s es be cans e s ome j unps ar e i ndi r ect ; t hey 



13 



have t he f or m JR rl wher e ri i s a r egi s t er cont ai ni ng an addr es s . Pi xi e ' s 
s ol ut i on i s t o i ncl ude an i ndi r ect j unp t abl e whi ch naps ever y addr es s i n t he 
original execut abl e t o i t s cor r es pondi ng addr es s i n t he pat ched execut abl e . 
I ndi r ect j unps ar e al rays imde t hr ough t hi s t abl e . The dr awback of R xi e 
i s t hat it can i nt r oduce non- negl i gi bl e over head. 

As econd appr oach i s to pat ch t he code at 1 i nk t i me , nodi f yi ng pc - r el at i ve 
j unps , text addresses in the data segment, and the relocation dictionary 
to correctly reflect changes in the executable flle. The advantage of this 
appr oach i s t hat al 1 j unp des t i nat i on addr es s es ar e cl ear 1 y i dent i flabl e and 
no over head i s added. See [9] f or a a conpl et e des cr i pt i on of t he t echni que . 

MTOCL us es a t hi r d appr oach. MTOCL pat ches t he execut abl e di r ect 1 y, 
but it does not us e an i ndi r ect j unp t abl e . I ns t ead, MTOCL pat tern imt ches 
t o flnd i ns t r uct i ons whi ch 1 oad addr es ses in the text s egment . The 1 oaded 
t ext addr es s is modi fled t o r epr es ent t he cor r es pondi ng addr es s i n t he i ns t r u- 
ment ed code . Thus , aJR rlwll succeed be cans e t he i ns t r uct i ons to 1 oad rl 
wi t h a val ue have been cor r ect 1 y updat ed. Thi s t echni que s uffer s f r omt he 
pot ent i al dr awback t hat t he pat tern imt cher coul d f ai 1 i n hi ghl y opt i ni zed 
code, but TO have encount er ed no pr obi ens to date. 
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